What is Vinho Verde?

The data set, from 2008, is comprised of white wines known as Vinho Verde (green wine). Vinho Verde (VV) refers to the Minho region of northern Portugal; it is not a type of grape. This region is known for being cooler and wetter than the rest of the country in the winter, although it does get hot in the summer.

VV wines have traditionally been known for being light, crisp and low in alchohol, and were meant to be drunk young (when they’re “green”). Lately, however, VV’s reputation of being a cheap & cheerful wine has begun to change.

Within this region, there are nine sub-regions.

vvmap

As you can see from the map, the topgraphy of the VV region is varied, as well as for several of its subregions. Vinho verde wines can come from the Atlantic coast, the mountains, or inland plains. Soils vary, as do the type of grapes grown.

About the data

How large is this data set?

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The data is either integer or numeric.

Alas, none of the categorical information mentioned above, such as soil type, grape variety, or subregion, is included in this data set.

Since we need at least one categorical variable for this dataset, let’s convert quality.

## [1] "3" "4" "5" "6" "7" "8" "9"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality 
##  Min.   : 8.00   3:  20  
##  1st Qu.: 9.50   4: 163  
##  Median :10.40   5:1457  
##  Mean   :10.51   6:2198  
##  3rd Qu.:11.40   7: 880  
##  Max.   :14.20   8: 175  
##                  9:   5

There are no NA values. Only citric.acid has zero values. The scale of values varies from the 100s for total.sulfur.dioxide to 1000th decimal place for chlorides.

Univariate plots

The first thing I wanted to look at was the quality.

quality

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most wines are average: 74.6% of wines earned a score of 5 or 6.

By contrast, only:

  • 5 wines earned a “9”
  • 20 wines earned a “3”

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol content ranges from 8 to 14.2% by volume.

The distribution looks unusual, not just bimodal but multimodal. The main mode is at 9.5%, but there are smaller, ‘local’ modes at 10.5% and 12% as well.

Dip at 11.5%

Regardless of binwidth, there was a clear dip at 11.5%. Why? Here is the likely explanation (it also explains the modes, max and min values):

The alcohol level of ‘generic’ Vinho Verde must lie between 8% and 11.5% ABV. However, if the wine is labelled with one of the nine sub-regions, which specialise in particular grape varieties, the range extends from 9% to 14% ABV. Additionally, Vinho Verde made from the single varietal Alvarinho can be between 11.5% and 14% ABV. -https://www.alcoholprofessor.com/blog/2014/04/23/vinho-verde-a-splash-of-summer-vinous-joy/

Are alcohol values continuous or discrete?

ggplot(aes(alcohol), data=whites) +
  geom_bar(color=I('green')) + 
  scale_x_continuous(breaks=seq(8,14.5,.5))

Alcohol content % is usually listed to the 10th of a percentage point. Round numbers, and .5 are more common than others.

fixed.acidity

In wine tasting, the term “acidity” refers to the fresh, tart and sour attributes of the wine which are evaluated in relation to how well the acidity balances out the sweetness and bitter components of the wine such as tannins. Three primary acids are found in wine grapes: tartaric, malic and citric acids. — http://winemaking.jackkeller.net/acid.asp

fixed.acidity in our data set only refers to tartaric acid, the most predominant acid in wine. It helps stabilize a wine’s chemical make-up and its colour. It also contributes to taste.

It is measured in g/dm^3, which is a more scientific notation for g/l. Multiply this value by 0.1 to calculate the % by volume.

How to intrepret fixed acidity values:

  • .4% (4g/l) is considered flat
  • 1% (10g/l) FA is considered too tart to be drinkable.
  • Most table wine is between 0.6 to 0.7%.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

fixed.acidity has a very slight negative but it is hard to see clearly on this plot. There seems to be some negative outliers too.

Let’s try a boxplot instead that also shows the underlying points.

Indeed, there are both negative and positive outliers, but more positive ones and they also extend further. Call this a near-normal distribution.

Do these wines have a higher fixed acidity, in general?

Let’s call this a normal distribution and calculate +/- 1 standard deviation to see where most fixed acidity values fall.

fa.sd <- sd(whites$fixed.acidity)
top <- median(whites$fixed.acidity) + fa.sd
bottom <- median(whites$fixed.acidity) - fa.sd

~68% of fixed.acidity values fall between 0.5956132 to 0.7643868%. In other words, the upper value of the range is slightly higher than normal for white wines.

volatile.acidity

Volatile acidity (VA) is primarily a measure of the presence of acetic acid. While a small amount is a natural by-product of fermentation, exposure to oxygen converts alcohol to acetic acid, which is known as oxidization. Too much acetic acid creates a vinegar taste in wine.

A VA of 0.03-0.06% is produced during fermentation and is considered a normal level. (source: http://www.wineperspective.com/the_acidity_of_wine.htm)

Since volatile.acidity is in g/l, multiply the values by 0.1 to calculate the %.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The distribution is negatively skewed. Let’s transform the x-axis to have a better look at the long tail values.

While most of values are less than 0.03%, there are quite a few outliers above 0.06%, with the highest reaching 0.11%.

citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The distribution for citric acid is normal with fewer positive outliers compared to the previous phytochemicals examined.

After reducing the binwidth I noticed some strange spikes at 0, 0.5 and 0.7g/l, and even at 1g/l.

Here is a possible explanation:

In the European Union, use of citric acid for acidification is prohibited, but limited use of citric acid is permitted for removing excess iron and copper from the wine if potassium ferrocyanide is not available. — https://en.wikipedia.org/wiki/Acids_in_wine#Citric_acid

And this is from a 2003 export agreement between Canada and the EU:

  1. addition of citric acid for wine stabilisation purposes, provided that the final content in the treated wine does not exceed 1 g/l, — http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52003PC0377 (section B, item 15)

pH

This is a test of how strong the acidity is. Wines typically have a pH between 2.9 and 3.9. The lower the pH, the more acidic (instead of basic) the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

In this dataset pH’s distribution is a near-perfect bell curve.

The entire range of pH values is from 2.72 to 3.82, so veers slightly more than usual to the acidic end of the wine pH spectrum.

chlorides

In most wines, the chloride concentration is below 50mg/l, expressed in sodium chloride. It may exceed 1g/l in wine made from grapes grown by the sea.

Sodium chloride is sometimes added during fining, especially when egg whites are used.- Handbook of Enology, The Chemistry of Wine: Stabilization and Treatments

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Negatively skewed with a particularly long tail.

Not much more info here. Will try a boxplot.

Transforming the x-axis to log10 shows that choloride values are discrete that beome continuous as chloride levels reach 0.3 g/l (300 mg/l) and a bit beyond. This is higher than normal, but still a far cry from 1 g/l.

I suspect the region’s notorious rain and mist in wintertime results in higher than usual salinity in the soil of coastal subregions, which is then absorbed by the grapes.

Density

Density is what makes wine feel full-bodied. Since VV wines are known for their lightness, I’d expect this dataset to be lower-density than your typical white wine dataset (if I had any to compare it to).

Higher density in wine is usually a result of higher sugar or higher alcohol content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Not suprisingly, since it’s related to sugar & alcohol, density’s distribution is a bit lumpy, but less than for residual.sugar and alcohol. It also has two extreme positive outliers, just like residual.sugar.

This doesn’t help much.

residual.sugar

Residual sugar, or lack thereof, in wines can be a sign of a flaw - secondary fermentation. For VV wines, this is considered a feature rather than a flaw.

Outside of Champagne, secondary fermentation in the bottle is a serious problem for winemakers, and one that calls for careful precautions. Besides generating an unpleasant effervescence (bubbly isn’t always better, hate to say), the secondary fermentation cuts into the residual sugars and unbalances the wine. But it’s even worse when the dormant yeast wakes up and starts eating up the acids in the wine.

This is called malolactic fermentation, and if it sounds familiar it’s because it is what gives new world Chardonnay that creamy, buttered-toast flavor. Unwanted malo is usually a serious concern, especially in white wines that rely on acidity for balance and texture, but the winemakers in Minho found that the ensuing slight fizziness caused by this flaw actually made the wine more palatable.-badass-sommelier-lets-drink-some-vinho-verde

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

This distribution looks like a swan, where the dramatic mode around 2 is the neck and then a lumpy tail as its body.

Let’s look at this using a log10 x-axis.

There are a few extreme outliers above 20. The distribution now looks like a camel: it has two humps. Are there two populations within this data set?

If we subset the dataset by residual.sugar value, will it also change the other odd-looking distributions, those for alcohol and density?

whites.dry <- whites %>%
  filter(residual.sugar <= 3)

nrow(whites.dry)
## [1] 1885
qplot(residual.sugar, data=whites.dry, binwidth=0.2, fill=I('pink'))

How did this change the distribution for quality, alcohol, and density?

For drier VV wines, the distributions for alcohol and density look considerably closer to a normal distribution. Even quality’s distribution looks more symmetric.

Now for the sweeter wines:

whites.sweet <- whites %>%
  filter(residual.sugar > 3)

nrow(whites.sweet)
## [1] 3013
qplot(residual.sugar, data=whites.sweet, binwidth=1, fill=I('pink'))

qplot(residual.sugar, data=whites.sweet, binwidth=1, 
      fill=I('pink'), color=I('white')) +
      scale_x_continuous(limits=c(0,25))

As seen in whites.dry, the distribution for quality looks even more symmetric than for the full dataset. The shape of distributions for alcohol and density, on the other hand, look virtually identical to those of the full dataset.

Univariate analysis

What is the structure of your dataset?

There are 4898 rows, and 13 variables. All variables are numeric. In order to have at least one categorical variable, I converted quality to a factor with 7 levels.

Most wine is of average quality, a 5 or a 6. The mean quality score is 5.8779094.

alcoholranges from 8 to 14.2%. The most common alcohol content is 9.5%, but the median is 10.4, very close to the mean is 10.514267.

Wine with a residual sugar content greater than 45 is considered sweet. Only one wine in the dataset would therefore be considered sweet. It had the maximum value of 65.8g. Most wines are far below this. Average residual.sugar content is 6.3914149g, while the median is a considerably lower 5.2g.

What is/are the main feature(s) of interest in your dataset?

The main feature I’m interested in are alcohol. The VV wine’s alcohol content determines how specific a region can be used on its label. Usually, the more specific the location of the wine, the higher the price it commands.

  • Generic VV wines is between 8% and 11.5%
  • VV wine from one of the subregions (i.e. which limits which grape varieties are used in its wines) is between 9% and 14%.
  • Single-varietal Alvarinho is between 11.5% and 14%.

I suspect as the region/varietal becomes more specific, its quality will go up.

The other key feature is residual.sugar. It plays a major part in flavour and density. Lower values could indicate secondary fermentation, which could either mean a wine that is pleasantly fizzy, or one that is more round and buttery instead of sharper and acidic. Or else it could be that it’s failed to achieve either and is simply unbalanced.

Chlorides is of interest, because several VV wine reviews I read online spoke, in positive terms, of VV wines with noticeable hints of salt. This seems strange, but in my research I’ve learned that sodium chloride in wine (as opposite to other forms of sodium) creates a soapy taste in wine. While this doesn’t sound appealing, it’s true salt is often used to contrast sweetness in chocolate and caramel.

More relevant to the data at hand, I’ve myself noticed a pleasantly and very subtle salty flavour to Txakoli wine, which has been compared to VV as it’s from Spanish Basque country, a similarly lush, wet region by the Atlantic.

Lastly, fixed.acidity is of interest because one of the classic ways to describe a wine is “a nice balance of sweetness and acidity” or else a “nice balance of alcohol and acidity”. Also, these are considered green wines: youthful, sharp, clean. All features I tend to associate with more acidity.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • density, because it provides the body, or “mouthfeel” for wine, which is also something you often hear discussed in wine reviews. It is also strongly related to sugar and alcohol.

  • higher values of volatile.acidity could indicate wines that have oxidized and so help identify lower-quality wines.

Did you create any new variables from existing variables in the dataset?

Nope.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes: residual.sugar and alcohol, and to a lesser extent, density.

Tranforming residual.sugar with log10 revealed a bimodal distribution with a mode at 2, and another mode around 10. The second mode is harder to read. The density plot actually shows two bumps on this second mode.

To see if these represented two distinct wine populations within the dataset, I created two subsets, whites.dry and whites.sweet, with a residual.sugar value of 3g being the dividing line.

When I replotted the distributions for some key variables, those for whites.dry appeared more symmetrical than in the full dataset. Those for whites.sweet, however, didn’t noticeably change shape.

I would like to compare the correlations for the full dataset and the two subsets.

Bivariate plots

Which variables have the highest correlation with quality?

Correlations

First we’ll need to create samples of each of the datasets.

# whites

whites.subset <- subset(whites,select=-X) #improve readability
whites.subset$quality <- as.numeric(whites.subset$quality) #to get correlation

# create sample so it's faster to run
set.seed(888)
whites_samp <- whites.subset[sample(1:nrow(whites.subset), 1000), ]

# whites.dry

whites.dry <- subset(whites.dry, select=-X)
whites.dry$quality <- as.numeric(whites.dry$quality)

set.seed(888)
whites.dry_samp <- whites.dry[sample(1:nrow(whites.dry),1000),]

# whites.sweet

whites.sweet <- subset(whites.sweet, select=-X) #improve readability
whites.sweet$quality <- as.numeric(whites.sweet$quality)

set.seed(888)
whites.sweet_samp <- whites.sweet[sample(1:nrow(whites.sweet),1000),]

Correlation matrix for whites

ggpairs(whites_samp, axisLabels = 'internal',
        lower = list(continuous = wrap("smooth", alpha=0.2, shape = I('.'), color='green')), 
        upper = list(combo = wrap("box", outlier.shape = I('.'))))

The 3 attributes with the highest correlations with quality are:

  • alcohol: 0.451479
  • density: -0.3273512
  • chlorides: -0.220217

Correlation matrix for whites.dry

The strongest quality correlations for whites.dry are:

  • alcohol: 0.4272068
  • density: -0.401994
  • volatile.acidity: -0.2307222
  • residual.sugar: 0.2010104

While alcohol’s correlation is only slightly lower, density’s negative correlation has gotten stronger. The biggest change is that drier VV wines, volatile.acidity and residual.sugar have relatively high positive correlations as well. In the full dataset, the third & fourth strongest correlations were chlorides and total.sulfur.dioxide (which was only -.17).

This is especially notably for residual.sugar, since it had a very low negative correlation with quality in the full dataset.

Correlation matrix for whites.sweet

For whites.sweet, the strongest quality correlations are:

  • alcohol: 0.4833503
  • density: -0.3311279
  • chlorides: -0.2504071
  • total.sulfur.dioxide: -0.2182216

These are also the same top 4 correlating variables with the full whites data frames, which is not surprising considering the shape of the distributions for alcohol, density, residual.sugar, and quality are virtually identical for whites and whites.sweet.

So which dataset do I use for the rest of the analysis?

TBD.

Ideally, I would like to use whites.dry. I suspect the improved correlation for density and chlorides might be due to the extreme positive outliers having more weight in this subset. In whites.dry by we lost all the positive outliers for residual.sugar and the closely-related density (and perhaps for others too).